Exploring the Synergy of Humans and Machines in Extreme Video Retrieval

نویسندگان

Alexander G. Hauptmann

Wei-Hao Lin

Rong Yan

Jun Yang

Robert V. Baron

Ming-yu Chen

Sean Gilroy

Michael D. Gordon

چکیده

We introduce an interface for efficient video search that exploits the human ability to quickly scan visual content, after an automatic system has done its best to arrange the images in order of relevance. While extreme video retrieval is taxing to the human, it has also been shown to be extremely effective. Two variants of extreme retrieval are demonstrated, 1) RSVP which automatically pages through images with the user controlling the page speed, while a user marks the relevant ones and 2) MBRP where the user manually controls the paging and can also adjust the number of images per page, depending on the density of relevant shots found. Pros and cons of each variant are discussed. 1 Interactive vs. Automatic Video Search When comparing results of fully automated video retrieval to interactive video retrieval [5], one finds a big gap in performance. The fully automated search (no user in the loop) succeeds with good recall for many topics, but relevant shots tend to be distributed throughout the top 3000 to 5000 slots in the ordered shot list, causing the standard metric of average precision for automated search to lag well behind most interactive runs. From this insight, we developed an interface that relies on superior human visual perception to compensate for low precision in automatic search of the visual contents of video [1]. The human user can filter the best automatically generated results and produce a better set that retains the relevant shots, resulting in much greater precision. We named this approach extreme video retrieval (XVR), as it combines the best machine performance with maximal use of human perception skills. Our interface explores two types of approaches to human filtering: rapid serial visual presentation and manually controlled browsing with resizing of pages. The success of XVR relies heavily on the ability of automatic retrieval systems to recall more relevant at as lower depth as possible. To study the machine extremes of our automatic retrieval system we take a one automatic run [3,4] with query classes and plot MAP over 24 TRECVID 2005 search topics at the depth k of shots, as shown in Figure 1. The automatic run demonstrates respectable performance, achieving MAP of around 0.1 at the depth of 1000 shots commonly chosen in TRECVID. After depth of 1000 shots MAP reaches the plateau, mainly due to the severe penalty for ranking relevant shots low in the calculation of average precisions. However, with the optimal ranking function, the optimal curve becomes the recall at the depth k, and clearly our automatic retrieval systems have decent recall. For easy comparison we plot the best performance of all search submission in TRECVID 2005. The results show that anyone who can browse through the top 2000 shots (merely 2.56% of TRECVID 2005 test set) for each topic, she could have achieved the best search performance in TRECVID 2005, and even better performance if she can look deeper/faster! Fig. 1. The MAPs over 24 TRECVID 2005 search topics of one CMU automatic runs, best interactive run in TRECVID 2005, and a hypothetical run with an optimal re-ranking function. 2. Human Extremes – RSVP Rapid Serial Visual Presentation (RSVP) is a technique of rapidly presenting a serial of images, and has been widely used in visualization and psychophysics experiments [2]. The basic version of RSVP, known as the keyhole mode [2], presents a sequence of images in the same position of the screen, where the following image replace the previous one every n milliseconds, n is thus the interval between two images. Users can vary the presentation speed (adding or subtracting 100ms from n) with two keys A (advance) and S (slow). When a relevant image is shown on the screen, users press the J key to mark the current image, plus the previous image because of the human reaction time delay between the presentation of the relevant image and the human motor response. Since two images are marked for each relevant key press, a second, correction phase is needed to carefully page through all marked images and validate the judgments. In the TRECVID 2005, we submitted one complete run using this variable speed keyhole RSVP interface, where it ranked 4th among all TRECVID 2005 interactive runs [4]. The 24 topics were completed over three consecutive days, with 4 topics in the morning session, and 4 topics in the afternoon session. Before each session one topic from 2004 was used as practice to “warm up” the participant. We found that subjects can correct around 100 images per minute in the second, correction, phase, and thus the length of the correction phase was dynamically determined by the program based on the number of relevant shots already marked in the first phase and the total available time. While no other existing video retrieval system uses RSVP, several reason argue tfor it: 1) RSVP is an interface specifically designed to present images rapidly, which match the human capability to quickly react to visual stimuli. 2) Keyhole mode requires no eye movements and therefore optimizes the time a user looks at an image. More complex displays such as grids or collages demand eye movements and extra time for eye fixation on every image. 3) RSVP automatically updates the next image in the sequence without manual paging, which reduces user cognitive load of pressing extra keys for each following display. 4) Variable speed control allows users to adjust the presentation speed. If we take the first derivatives of the optimal curve in Figure 1 we note that the rate of relevant images is not constant. There are more relevant shots in the early, top ranks than later. Thus it makes sense to use slower speeds for the earlier top-ranked shots reducing the chance of missing relevant shots, and speeding up for later, lower-ranked results. Variable speed also allows users to slow down for a break when their attention is failing. A second TRECVID 2005 RSVP submission used a 2 image simultaneous display on each page. Each key press then marked both images on the current page, as well as the two images on the previous page as relevant, requiring four images to be verified in the validation/correction phase. Since there were many images marked, subjects were frequently not able to correct all images selected during the initial RSVP phase, resulting in lower mean average precision. Manual Browsing in XVR Fig. 2. Manual browsing with different pa and 3x3 for the rest of the shots. The gree and the keyboard section below a page sho Manual Browsing with Resizing Pa which, unlike RSVP, where the same the search, allows adapting the page s relevant shots. At the beginning whe page size since multiple relevant shot attention (per image) and key presse ge layouts: 1x2 at the beginning, 2x2 in a later stage, n bounding boxes indicate the shots labeled relevant, ws the keys for labeling the respective shots ges (MBRP) is a strategy for interactive search, number of shots per page are used throughout ize according to the (decreasing) percentage of n relevant shots are frequent, we use a small s are likely on one page, which demands more s to label them. Later when relevant shots become infrequent, large page sizes become efficient since it is unlikely that multiple relevant shots will appear even on a large page. MBRP thus reduces the overhead of page turning and the number of necessary key presses for relevant images on a page. Since the time a user spends browsing each page depends on the page size, the visual complexity of the answer, and the number of correct shots, this time can vary dramatically with different pages. The user may occasionally need to turn back to previous pages to correct erroneous labels. Thus MBRP gains an advantage by letting users turn pages using a forward and backward keyboard key. Unlike the RSVP, where a single key is used to mark all shots in a page, MBRP allows up to 16 keys (in a 4x4 layout on the keyboard) for labeling 16 shots simultaneously, with one key corresponding to each presented image. Moreover, another key can be used to label all the shots on the current page and automatically turn the page. Although a page can include any layout of images (e.g., 3x3, 2x5, etc), we use only 1x2, 2x2, and 3x3 for two reasons. First, with practice, one hand can conveniently label any shot(s) in layouts up to 3x3 shots, but not more than 9 shots per page. Second, visually inspecting more than 9 shots per page is less time-efficient. As the user must label as many shots as possible in a fixed time, errors are inevitable due to time pressures. While missed relevant shots cannot be found during the verification phase, usually one or two minutes are used to correct false alarm errors. In addition, if the user is unsure about the relevance of a shot, it can be marked as “maybe”; where all “maybe” shots will be sorted after those ranked as “relevant”. A TRECVID 2005 submission using MBRP averaged in looking at about 2000 shots within the 15 minutes for each topic. Typically, this number is higher for queries that are easily identifiable visually, and vice versa. For example, for the query of “tennis”, a user could browse almost 5,000 shots in the allocated 15 minutes time. The MBRP run achieved the mean average precision (MAP) of 0.408 on the TRECVID 2005 evaluation, which ranked second among a total of 50 interactive submissions and was only marginally behind the best run (MAP = 0.414). It also outperformed the submission using the best RSVP method (MAP = 0.366).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Comparative Study of Extreme Learning Machines and Support Vector Machines in Prediction of Sediment Transport in Open Channels

The limiting velocity in open channels to prevent long-term sedimentation is predicted in this paper using a powerful soft computing technique known as Extreme Learning Machines (ELM). The ELM is a single Layer Feed-forward Neural Network (SLFNN) with a high level of training speed. The dimensionless parameter of limiting velocity which is known as the densimetric Froude number (Fr) is predicte...

متن کامل

Stable Rough Extreme Learning Machines for the Identification of Uncertain Continuous-Time Nonlinear Systems

‎Rough extreme learning machines (RELMs) are rough-neural networks with one hidden layer where the parameters between the inputs and hidden neurons are arbitrarily chosen and never updated‎. ‎In this paper‎, ‎we propose RELMs with a stable online learning algorithm for the identification of continuous-time nonlinear systems in the presence of noises and uncertainties‎, ‎and we prove the global ...

متن کامل

A New Method for Detecting Ships in Low Size and Low Contrast Marine Images: Using Deep Stacked Extreme Learning Machines

Detecting ships in marine images is an essential problem in maritime surveillance systems. Although several types of deep neural networks have almost ubiquitously used for this purpose, but the performance of such networks greatly drops when they are exposed to low size and low contrast images which have been captured by passive monitoring systems. On the other hand factors such as sea waves, c...

متن کامل

Outlier Detection Using Extreme Learning Machines Based on Quantum Fuzzy C-Means

One of the most important concerns of a data miner is always to have accurate and error-free data. Data that does not contain human errors and whose records are full and contain correct data. In this paper, a new learning model based on an extreme learning machine neural network is proposed for outlier detection. The function of neural networks depends on various parameters such as the structur...

متن کامل

Efficient Action and Event Recognition in Videos Using Extreme Learning Machines

EFFICIENT ACTION AND EVENT RECOGNITION IN VIDEOS USING EXTREME LEARNING MACHINES A great deal of research in computer vision community has gone into action and event recognition studies. Automatic video understanding for actions are crucial for application areas such as video indexing, surveillance and video summarization. In this thesis, we explore action and event recognition on RGB videos bo...

متن کامل

Reliability Based Maintenance and Human Resources Work-Rest Scheduling in Manufacturing System

In today's competitive market, all manufacturers attempt to improve their maintenance policy in order to decrease the cost of failure and increase the quality of products, but most of these attempts do not consider the role of humans involved in a manufacturing system. Human resources are the main factor in manufacturing that has an undeniable effect on products quality, machines reliability, s...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2006

Exploring the Synergy of Humans and Machines in Extreme Video Retrieval

نویسندگان

چکیده

منابع مشابه

A Comparative Study of Extreme Learning Machines and Support Vector Machines in Prediction of Sediment Transport in Open Channels

Stable Rough Extreme Learning Machines for the Identification of Uncertain Continuous-Time Nonlinear Systems

A New Method for Detecting Ships in Low Size and Low Contrast Marine Images: Using Deep Stacked Extreme Learning Machines

Outlier Detection Using Extreme Learning Machines Based on Quantum Fuzzy C-Means

Efficient Action and Event Recognition in Videos Using Extreme Learning Machines

Reliability Based Maintenance and Human Resources Work-Rest Scheduling in Manufacturing System

عنوان ژورنال:

اشتراک گذاری